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Abstract 

We develop a mixture-based approach to ro- 
bust density modeling and outlier detection 
for experimental multivariate data that in- 
cludes measurement error information. Our 
model is designed to infer atypical measure- 
ments that are not due to errors, aiming to 
retrieve potentially interesting peculiar ob- 
jects. Since exact inference is not possible 
in this model, we develop a tree-structured 
variational EM solution. This compares fa- 
vorably against a fully factorial approxima- 
tion scheme, approaching the accuracy of a 
Markov-Chain-EM, while maintaining com- 
putational simplicity. We demonstrate the 
benefits of including measurement errors in 
the model, in terms of improved outlier de- 
tection rates in varying measurement uncer- 
tainty conditions. We then use this approach 
in detecting peculiar quasars from an astro- 
physical survey, given photometric measure- 
ments with errors. 



1. Introduction 

The goal in robust unsupervised data modeling is to 
capture the structure of the typical observations while 
dealing with atypical or outlying observations in an 
automated manner. Outliers can occur for various 
reasons, such as measurement errors or the existence 
of peculiar objects in a data set. If atypical obser- 
vations exist and are not properly dealt with, they 
lead to biases in the parameter estimates and poor 
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generalization of the structure inferred from the data. 
Therefore, a great deal of effort has been invested 
into modifying existing unsupervised methods to pro- 
vide them with robustness properties. In the statis- 
tics and statistical machine learning communities, the 
Student t-distribution was put forth and adopted as a 
robust building block, for clustering (Peel & McLach- 
lan, 2000; Svensen & Bishop, 2005), visuahzation (Vel- 
lido et al., 2005) and robust projections (Archambeau 
et al., 2006). The t-distribution has heavy tails, hence 
it gives non-zero probability to observations that are 
far away from the bulk of the density. 

Apart from the issue of robustness of the parameter 
estimates, the ability of detecting outliers is of spe- 
cial interest in certain scientific areas such as in As- 
trophysics (Djorgovski et al., 2001), where finding pe- 
culiar objects from large archives of multi- wavelength 
astronomical images provide a unique means of identi- 
fying candidates of possibly new types of objects that 
deserve more detailed follow-up study (e.g. using spec- 
troscopy). However, a bottleneck already anticipated 
in Djorgovski et al. (2001) is 'the likely overabun- 
dance of interesting objects found' - the interpretation 
and understanding of which will necessitate costly de- 
tailed analysis. Indeed, not every atypical observation 
is truly interesting. One reason for this lies in measure- 
ment errors resulting from uncertainties in instrument 
calibration and physical limitations of devices and ex- 
perimental conditions. These errors are typically care- 
fully recorded in the case of scientific data and are 
available. Yet, most existing data analysis methods 
have no natural ways of taking these into considera- 
tion. In turn, neglecting the error information holds 
the risk of compromising the accuracy with which gen- 
uine outliers can be detected, since there is nothing 
to prevent us from confusing erroneous measurements 
with potentially interesting rare or peculiar ones. 
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In classical statistics, models known as 'errors in vari- 
ables' exist, such as the total least square approach 
for robust regression (HufFel & Lemmerling, 2002). 
Probabilistic approaches able to propagate uncertainty 
have also started to appear recently (Liu et al., 2006; 
Girard et al., 2003) for certain problems, and their 
benefits have been convincingly demonstrated. How- 
ever we are aware of no work on including knowledge 
of observational errors specifically for unsupervised ro- 
bust density modeling. Due to the importance of this 
issue in scientific data mining, this paper makes an 
attempt to fill in this gap. 

2. Robust mixtures for data with errors 

Consider a data set in which each individual measure- 



ment is an estimate of the form ti 



where 



n = l,...,N,i = l,...,d, N is the number of object 
instances and d is the number of features. It is con- 
ceptually justified to assume that the error associated 
with these individual measurements is normally dis- 
tributed (Taylor, 1996). Organizing the square of er- 
rors into diagonal matrices S„, for each measured d- 
dimcnsional data point t„ , the following heteroscedas- 
tic noise model can be written. 



p(tn|w„) = Af{t„\w„,Sn); 



(1) 



where A/'(t„|w„, S„) denotes the normal distribution 
with unknown mean w„ and known diagonal covari- 
ance matrix S„. 

The unknown mean values w„ represent the clean, 
error-free version of the data. Since these cannot be 
observed directly, we will treat them as latent vari- 
ables. The genuine outliers, which we are interested 
in, must be those of the density of w rather than those 
of the density of t. We will therefore model the hid- 
den clean density as a robust mixture of Student i- 
distributions (MoT)^: 



(2) 



where p(w|fc) is the Student t-density. By the use of 
t-densities, we make no assumptions on the distribu- 
tion of outliers. Outliers are instances outside the high 
density 'cluster' regions. 

St(w|/Jfe,Sfc,J/fc) = ^ ^ — r3— 



1 + 



Gaussian with a Gamma placed on its precisions, 

oo 

5«(w|M,E,z/) = |AA(wK^)a(M|^,^)rf«; (3) 



where Q is the Gamma density, Q{u\a,b) = 
^a^a-i exp(-bM) ^ rj^^^^ re- Writing has been exploited for 
developing an exact ML estimation algorithm for the 
MoT model (Peel & McLachlan, 2000). 

In our model, the distribution of the observed data t 
can be obtained by integration over w. According to 
Eqs. (1), (2) and (3), we have: 

(4) 

Thus, given a set of training data y = (ti,-- - jt^r), 
the complete probability model of the observed vari- 
able t and the latent variables w, u, z will have the 
following factorized form: 

JCc = ]^ ]^[p(t™|Wn)p(w„|M„, = A;)]^^^"-" 



.k) 



]^]^[p(M„|«n = k)p{zn = k\n)y 



(z„=fe) 



(5) 



where 5{-) is the Kronecker delta. The plate diagram 

representation of this model is shown on the right-hand 
plot of Fig. 1, along with that of the MoT model. 

[This figure is available upon request from author] 
Figure 1. Plate diagrams of MoT (left), and the proposed 
model (right). 

3. A structured variational EM solution 

Since the integration in Eq. (4) is not tractable, we 
develop a generalized EM (GEM) algorithm (see e.g. 
Hogg et al. (2005)), with approximate E-step. In gen- 
eral terms, for each data point t„, its log- likelihood 
can be written as follows, for any distribution q: 



logp(t„|6l) 



q{hn) p(hn\tn,e) 



I q{hn) log ^liii^d/, = ^(t„|g, e) 



q{h„) 



where q is the free-form variational family (or varia- 
tional posterior), T is called the variational free en- 
ergy function, /i„ is the set of latent variables as- 
sociated with t„, and 6 is the set of parameters of 
the model. In our case, /i„ = (z„,w„,u„) and 6 — 
({Mfe}) {^ft}) {i^fc}) "■)• The log-likehhood of the given 
data set y is then lower bounded by the free energy: 

logp(3^|^) = ^logp(t„|^) > Y,^i^n\q{K),e) (6) 



As noted in Liu and Rubin (1995), with the in- 
troduction of an auxiliary hidden variable u, the t- 
distribution can be re-written as a convolution of a 



'^The instance indices n will be dropped for convenience, 
whenever their presence is obvious from the context. 



In the E-step of the (A; -|- l)-th iteration of a GEM al- 
gorithm, we maximize w.r.t the variational distribu- 
tion q while fixing the parameters in the k-th iteration. 



g'=+i(/i„) = argmaxj^{t„\q,e''). 



(7) 
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In the M-step, we maximize Eq. (6) w.r.t the param- 
eters 6 to obtain the new parameter values ^'^+^: 



identically null function. We obtain the following: 



arg max 

9 ' ' 



„k+l 



,0). 



(8) 



3.1. Tree-structured variational distribution 

Some tractable form needs to be chosen for q. The 
most common choice is a fully factorial form (Jordan 
ct al., 1999). In our case, this would be q{'w,u,z) = 
q{w)q{u)q{z). In the context of robust mixtures, fully 
factorial variational posterior distributions have been 
employed in (Svensen & Bishop, 2005), though with a 
slightly different model specification. Let us observe, 
however, that under our model definitions, it is feasible 
to keep some of the posterior dependencies by choosing 
the following tree-structured variational distribution: 



g(w, u, z = k) = q{z = k)q{w\z = k)q{u\z = k) 

Structured variational distributions have been used 
previously in the context of various other latent vari- 
able models (Gcigcr & Meek, 2005; Bishop & Winn, 
2003) and have been found more accurate compared 
to the fully factorial choice. Yet, their use is still 
not as popular as it could be. In the following, we 
denote q{z = k) by q{k), q{yv\z = k) by q{yf\k) 
and q{u\z = k) by q{u\k). Also, expectations w.r.t. 
q(yf\z = k) will be denoted by (.)w|fc and similarly, 
those w.r.t. q{u\k) by and those w.r.t. the joint 

q{w,u\k) = q(w\k)q{u\k) by (.)w,«|fe- 

3.2. Deriving the GEM algorithm 

The free energy function J^{t\q, 9) can be evaluated as 

T{t\q, e) = Y, lik) [{logp(t, w, u, fc))w,„ifc] + H{q) 



q{w\k) 
q{u\k) 
q{k) 



Cxp(log [p(t|w)p(w|n, fc)])„|fc 

/ cxp(log [p(t|w)p(w|ii, fc)])„|fcdw 

exp(log \p{yf\u, fc)p(M|fc)])w|fc 
/exp(log [p(w|m, k)p{u\k)])^\kdu 

CXp(At,fc) 



(10) 

(11) 
(12) 



It can be seen that q{-w\k) and q{u\k) depend only on 
variables in their Markov blanket. However, the dis- 
tribution q{k) depends on all other variables in the 
graph. Conveniently, the quantities required for com- 
puting Eq. (12) will be available from the computa- 
tions that are needed for evaluating the free energy 
function — which in turn is useful for monitoring the 
convergence of the GEM iterations. 

Due to the conjugacy properties of the distributions 
we used, and after simplification, we now can obtain q 
analytically. Let us define: 



{u)u\i 



{iAu\, 



{w)fc = S.>^|fc ((M)„|fcSfc Vfc + s H) 



(13) 



(14) 



ak = — - — ; bk = (15) 



where 



Ck = ((w)fe - ^kf ^ ({w)fe - Mfe) + Tr (Sfe ^S^ifc) . 

(16) 

Then we have: 

g(w|fc) =AA(w|(w)fc,S^|fc); q{u\k)=g{u\ak,hk)- (17) 



where II{q) is the entropy of the variational distribu- 
tion: H{q) = -Efe9(fc) [(log(g(ttlfc)g(w|fc)g(fc)))„,„|fc]. 
Defining At,k = (logp(t, w, w, fe))w,u|fe - {^og q{u\k))u\k - 
(logg(w|fc))w|fc then we have: 

^(t|9, e) = Y, Q{k) [At.k - log q{k)] . (9) 

k 



3.2.1. Variational E-step 

Now, in order to find the optimal functional form of the 
posterior distribution terms, we take functional deriva- 
tives of J^{t\q,0) w.r.t. the terms of q, i.e. q{w\k), 
q{u\k) and q{k) respectively, and equate these to the 



3.2.2. The variational likelihood bound 
At,k can be evaluated as follows: 

At,k = (logp(t|w))„|i. -I- (logp(M|fc))„|fc -I- logTTfc -I- 
(logp(w|«,fc))„,„|fc - (logg(w|fe))w|fc - 
(logg(-«|fc))„|fe 
= Q1+Q2 + Q3 + Q4 + QS + Q6 (18) 

where 

Qi = -^log(2^)-ilog|S|-iTr(Sw|fcS-^) 
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-^[((w)fc-tfS-i({w)fe-t)] 
Q, = (_-l)(logu)„|fc--- 

+ Ylog(y) -logr(y); 
Q3 = -^log(27r)-llog|S,| + ^J 



Ok 
_ 1 Ofe 

2 6fc 

logTTfe; 



Tr(S^lfcS^i) 



Q4 

Qr, = ^ + ^log(27r) + ilog|S„|fe|; 



Qe 



-[(ofc - l){\ogu)u\k +afelog6fc - fflfc - logr(afe)]; 



and where (logM)^^. = ■^(ttfe) — logbk and V(') is the 
di-gamma function. 

In summary, given a data 3^, the log hkelihood bound 
is computed cf. Eq. (9) as the following: 

JF = ^ ^ q{zn = k) [At^,k - log q{zn = k)] (19) 



where At^^k is computed as in Eq. (18) for each data 
point t„. Eq. (19) is useful to monitoring the conver- 
gence. 

3.2.3. M-STEP 

The parameter re-estimates are obtained by solving 
the stationary equations of w.r.t /ifc, Sfc and Wk, 
which yields: 



I" A; 



l^n = l 'ly^r. = k){u„)u„\k 



Etl?(^n = fc) 

1 

- Qi^n = k) 



(21) 
(22) 



where = [(^tfc - {w„>fc)(^ifc - {w„)fc)'^ + S^^l^] . 

Finally, I'k is re-estimated by solving the following non- 
linear equation. 

Y,l{Zn = k)[\0g{^) + 1 + (logU„), - ^ - V(^)] = 0. 



3.3. Scaling 

Considering the time complexity of the algorithm, per 
iteration, computing the posterior mean and covari- 
ance ((w)^ and Sw|fc) for each data point t takes 
0{(fK) operations. The computation of the parame- 
ters of q{u\k), ak and bk take 0{d^K), and the respon- 
sibility q{k) needs 0{SK) time as well. In total, this is 



0{d^KN). For comparison, the maximum likelihood 
estimation of MoT (Peel & McLachlan, 2000) takes 
0{<PK) to compute p(u|A:, t) and 0{d?K) to compute 
p{k\t), which totals a complexity of 0{(PKN) - same 
as that of proposed algorithm. Moreover, using a full 
factorial approximation in our model also results in the 
same theoretical complexity per iteration. So the only 
extra burden of our proposed method is the computa- 
tion of the posterior mean and covariance of the clean 
data w. The most expensive operation appears to be 
the matrix inversion, however, it should be noted, this 
is only required when are modeled as a full covari- 
ances, which is feasible in relatively low-dimensional 
problems (d <C -/V). If this model was to be used on 
high dimensional data, then a diagonal form would 
need to be taken — in which case the cubic operation 
is no longer required since the matrices to be inverted 
become diagonal. 

3.4. Accommodating new data points 

Since the model is fully generative, it can also be ap- 
plied to new, previously unseen data from the same 
source. For a given test data set, we need to calculate 
the posterior distributions of w„ and w„ associated 
with each test point t„. To calculate these, we fix the 
parameters /x^, Sfc and tt^, 1 < k < K obtained from 
the training set and perform the E-step iterations until 
convergence. This typically converges at least an order 
of magnitude faster than the full training procedure. 

3.5. Determining the number of components 

To determine the number of mixture components, the 
minimum message length (MML) criterion (Figueiredo 
& Jain, 2002) could be employed. We derive a lower 
bound to MML, so the optimal number of components 
is found by maximizing the following criterion: 



my) = 



-I E 

fc:7rfc>0 

k„z{n+ 1) 



NTTk \ _ knz 

12 J 2 

+ iogp{y\e) 



log 



(23) 



where p{y\0) is the data log-likelihood, n is the di- 
mensionality of the parameters, knz is the number of 
non-zero-probability components. The free parame- 
ters involved in the proposed algorithm are the means 
and the full covariance matrices of jV(w„|u„, z„ = k). 
Thus the dimensionality of the fc-th parameter 9^ = 
{jjik, is d + d{d — l)/2. We approximate the data 
likelihood as earlier, by Eq. (6). Replacing this in 
(23), leads to maximizing: 



my) > E iog(^)-^iog^^ 



12 



'^^^^^^ + J2^{t„\q,0)^ciy\<i,e). 
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This maximization is similar to the GEM presented in 
Section 3.1 and algorithmically the only difference is in 
computing the mixing proportions TTfe in the M-steps, 
which is now: 



{0,Ellg(^n=fc)-i} 

aax{0,Elig(^n=j)- t} 



(24) 



Of course, only the non-zero-probability components 
of the mixtures will contribute to g(w„|z„), q{un\zn) 
and q{zn). 

4. The Outlier Detection Criteria 

Since we modelled the clean, crror-frcc data by a mix- 
ture of t-distributions, wc would expect that the model 
can find outliers w.r.t the clean data, rather than 
the contaminated data. Following Peel and McLach- 
lan (2000), the posterior expectation of u is inter- 
pretable as an outlierness indicator. Using our pos- 
terior approximations described earlier (i.e. q{u, k) = 
q(u\k)q{k) to approximate p{u,k\t)), then the varia- 
tional expectation of u will be employed to infer out- 
lierness. This is: 



e = ^g(fc) 



i^k + 1 



z/fc + Tr(E;^iS^|fc) + A2 ^ 



(25) 



where Uk and bk are defined in Eq. (15); and A^n. = 

((w)w|fc-Mfc)^S;:^((w)„|fc-^fc). Therefore, a data point 
is considered to be an outlier if its corresponding e 
value is sufficiently small. 

In contrast, recall that for MoT, the outlier criterion 
value (Peel & McLachlan, 2000) is 



CMoT 



Vk+ d 



Vk + (t - ^fc)^Sj. (t - ^fc 



(26) 



So we see that rather than instead of the Mahalanobis 
distance between the mean and the data t as in 

Eq. (26), the distance between the center /ife and the 
expected value of the clean data w is present in (25). 

Further, it can easily be seen, for consistency, that 
in the limit of zero observation error, our outlierness 
criterion reduces to that of MoT. Indeed, whenever 
S = 0, Eqs. (13), (14) and (16) can be written as: 

Sw|fc = 0; {w)w|fc = t; 

Ck = (t-^iA:)'^S^^(t-//fc); 

Then replacing the above equations into Eqs. (11), 
(20), (21) and (22), we can recover the posteriors as: 

q{u\k) = p{u\t,k);q{k) = p{k\t); 

and so, the update formulas of the MoT are recov- 
ered (Peel & McLachlan, 2000). 

If the size of the measurement error S (we can mea- 
sure the size of S by its trace) is small, we expect 



the difference between e and cmoT is relatively small 
too. However, as the size of the measurement error 
gets larger, the difference between the two outlierness 
criteria becomes larger and consequently the ranking 
they produce will be different. In particular, we can 
gain more insights and see the effects of a misspecifi- 
cation of the error by rewriting the data likelihood (4) 
by integrating over w: 

P(tn) = E / ^ (t" I*"" 1^ + S") S ("I ^ . ^) dUn (27) 

The posterior expectations are data instance- 
specific, ensuring the robustness of the parameter esti- 
mates, even if the errors (diagonals of S„) are misspec- 
ified. However, this also implies that a data instance 
with an underestimated S„ gets picked as a false 'inter- 
esting' outlier ((u„) gets smaller). Clearly, if all errors 
are specified at zero, our model reduces to MoT and 
produces unwanted false detections. 



5. Experiments and results 

To test the performance of the proposed algorithm, 
first we experimentally assess the accuracy of the 
structured factorization employed. Second, we per- 
form a set of controlled experiments on synthetic and 
semi-synthetic data sets, in order to demonstrate the 
ability of defecting genuine outliers. Finally, we shall 
present a real application of our approach in astron- 
omy, for finding peculiar (high redshift) objects from 
the SDSS quasar catalogue (York, 2000). 

5.1. Synthetic data &: illustrative experiment 

A synthetic data set is constructed comprising of error- 
free values sampled from a mixture of three well sepa- 
rated Gaussians and a uniform distribution simulates 
the presence of genuine outliers. Then we add Gaus- 
sian noise to all points, to simulate measurement er- 
rors, and apply our algorithm to the resulting dataset. 
The aim is to recover the genuine outliers (along with 
the density of non-outliers) , despite the Gaussian noise 
added. The leftmost plot of Fig. 2 shows the error- 
free data, with the Gaussian covariances of the clusters 
of non-outliers superimposed. Different markers are 
used for points in different clusters and the outliers are 
marked with stars. The central plot shows the effect 
of simulating measurement errors. The marker sizes 
are proportional to the size of errors. Notice that due 
to the errors, some outliers appear closer to the main 
density regions while some of the non-outliers 'jump' 
away from the bulk of density. Thus the measurement 
errors make the problem of recovering genuine outliers 
much more challenging. The rightmost plots of Fig. 2 
shows the result of our estimation procedure described 
earlier, superimposed over the data with errors. 
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-10 10 -10 10 -10 10 

Figure 2. A synthetic data set with cluster structure and 
outliers. Left: Hidden error-free data with genuine out- 
liers; Center: Data contaminated with measurement errors; 
Right: The estimated grouping and detected outliers. 

5.2. Comparison of alternative approximate 
EM methods 

Now, we test the accuracy of the structured varia- 
tional EM method developed, against a fully facto- 
rial variational EM for the same model, and a Markov 
Chain EM (MCEM) reahzed through Gibbs sampling, 
the latter being considered to represent the 'ground 
truth'. Fig. 3 shows the approximation of the log 
likelihood against iterations in a run on the synthetic 
data set shown earlier. For Gibbs sampling MCEM, 
M = 10, 000 samples were used for computing the pos- 
terior estimates. The first 2000 samples were discarded 
as burn-in. All algorithms were started from the same 
initial parameter values. As expected, MCEM is supe- 



-7200 




■MCEM 

- The proposed algorithm 

- Full factorization 



20 40 60 80 100 120 140 160 180 200 
Iterations 

Figure 3. The optimization process of alternative approxi- 
mate EM algorithms on the synthetic data set. 



rior to variational methods, but at the price of a heavy 
computational demand and difficulties in determining 



its convergence. From the figures we also see the struc- 
tured variational EM is closer to MCEM than the fully 
factorial variational EM. Therefore we use this method 
in the remainder of experiments reported^ . 

5.3. Assessing the accuracy of detecting 
genuine outliers 

To see how well can we detect genuine outliers, we 
start by carrying out a set of controlled experiments, 
varying the extent of measurement errors. We use our 
synthetic data sets and define five different measure- 
ment error levels: The diagonal elements of the error 
variance matrix S will range between [0,0.01], [0,0.1], 
[0,1], [0,10] and [0,100] respectively. 

We perform receiver operating characteristics (ROC) 
analysis (Fawcett, 2004) to measure the performance. 
The area under the ROC curve (AUG) gives us the 
probability that a genuine outlier is detected. The 
MoT is employed as a baseline in our comparisons, 
in two instances: i) MoT applied to the clean data 
(which in real applications is not available) provides 
an idealized upper limit; ii) MoT applied to the data 
contaminated with observation errors provides a base- 
line against of which we measure our improvements. 
Fig. 4 summarizes the results obtained. For each of 
the 5 error conditions, the mean and standard devia- 
tion of the AUG values over repeated runs on 30 inde- 
pendent realizations of the data are shown: The up- 
per plot shows the in-sample performance whereas the 
lower plot shows the out-of-sample performance, i.e. 
the ability to detect genuine outliers in previously un- 
seen data from the same density model. The results 
are intuitive — we see a systematic and increasingly 
statistically significant improvement w.r.t. MoT/base, 
as the measurement uncertainty increases, both on is- 
sample data and on out-of-sample data. 

In order to test our method further on data with a 
more realistic underlying density, while still being able 
to evaluate the benefits of using measurement error in- 
formation in a controlled manner, we now apply our 
method on semi-synthetic data derived from real data, 
the lymphography data set (Blake & Merz, 1998). 
Originally, the data has four classes (148 data points 
in total and 18 dimensions), but two of them are quite 
small (2 and 4 data records), so we consider the two 
small classes as outliers. We added heteroscedastic 



^We also tested the full- factorial version, as well as the 
possibility of obtaining maximum a posteriori estimates for 
u„ by conjugate gradients optimisation. Both have been 
found inferior to the structured variational EM approach, 
on the synthetic data sets tested, both in terms of their 
accuracy of detecting genuine outliers, and their clustering 
accuracy rates evaluated against the true cluster labels. 
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Figure 4. Comparison of MoT/idealized, MoT/base and 
the proposed algorithm on data sets with different levels 
of measurement error: in-sample (left) and out-of-sample 
(right). 

Gaussian noise with variances ranging between 0-0.1, 
to each observation, in order to simulate errors. 

The algorithms were run on 10 independent realiza- 
tions of the measurement errors and computed the 
average ROC curves (Fawcett, 2004) and associated 
average AUG. The in-sample (93 data points) aver- 
age AUG obtained by MoT/idealized is 0.9391, by 
MoT/base it is 0.9005, whereas the proposed algo- 
rithm obtained 0.9555. The significance values of a 
t-test between MoT/idealized and the proposed algo- 
rithm has been 0.39, while the value between the pro- 
posed algorithm and MoT/base was 9.6 x 10"^. This 
suggests that the proposed algorithm performs compa- 
rably to MoT/idealized and significantly better than 
MoT/base in this experiment. Moreover, the out of 
sample (55 data points) performance has also been of 
the same quality and this is shown in Fig. 5. We can 
conclude therefore, that knowledge of measurement er- 
rors is useful and can be exploited with the use of our 
approach to achieve a more accurate detection of gen- 
uine outliers. 

5.4. Application to Detecting high-redshift 
quasars from the SDSS quasar catalogue 

In astrophysical measurements, there is no error-free 
situation (Taylor, 1996) since the objets observed are 
too far away. The measurement errors are known for 
each feature and each object, though the error-free 
data is not accessible. Unlike in the previous sec- 
tions, a validation against an absolute ground truth 
is therefore no longer possible. Nevertheless, the data 
set analyzed here is well-studied in astrophysics, the 
SDSS quasar catalog (York, 2000), which provides five 
magnitudes for a large number of quasars, represent- 
ing their brightness measured with five different fil- 
ters u' g' r' i' and z'. From these, to avoid bias with 
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Figure 5. Out of sample average ROC curves of 
MoT/idealized, MoT/base and the proposed algo- 
rithm on the lymphography data set. The error bars 
represent one standard deviation. 



brightness, we construct four features, each related to 
a color, by subtracting r' (which is the most reliably 
measured) from each of the others. In addition, spec- 
troscopic redshift estimates are available — these are 
not used within the algorithm, but are useful to de- 
rive a way of validating our results. The redshift is re- 
lated to the distance of the object from the Earth, and 
very distant objects are rare. Given that with higher 
redshift, the entire spectral pattern is systematically 
shifted towards one end, there is physical reason for 
high redshift quasars to be perceived as outliers in the 
overall density of quasars in the color space. This ob- 
servation has been exploited in a number of previous 
studies for finding high redshift quasars in various 2D 
projections of the data (Fan, 2006). However, a com- 
prehensive approach which both i) works in the multi- 
variate feature space and ii) takes principled account 
of the measurement errors has not been available. 

We apply our method to a sample of 10,000 quasars 
and compute the AUG values against a varying red- 
shift threshold. The resulting relationship is shown in 
Fig. 6, for different choices of K. The optimal order 
determined by MML was K — 2, nevertheless, from 
the figure we see a remarkable robustness w.r.t. this 
choice. The y-coordinate of each point on these curves 
indicates the probability of detecting quasars of red- 
shift greater than its x-coordinate. This plot shows 
clearly that our principled method in four-color space, 
using errors, can identify as outliers an overwhelm- 
ing fraction of quasars already at a redshift of 2.5 (or 
higher), whereas the 2D projection methods, e.g. Fan 
(2006), can manage only those with z>3.5, which are 
extremely rare, and obvious from naive projections. 
By being able to identify the latter category, when 
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the SDSS galaxy catalogue is complete with four-color 
magnitudes, our method promises to retrieve an order 
of magnitude more interesting high-redshift quasars 
than existing methods would. 




Redshift 

Figure 6. AUG versus possible redshift thresholds. 



6. Conclusions 

We proposed a robust mixture model for multivari- 
ate data that includes error information. This was 
achieved by employing composite densities, designed 
to infer peculiarity that is not due to errors. We de- 
rived a structured variational EM algorithm for in- 
ference and parameter estimation, which in the zero 
limit of the measurement errors reduces to maximum 
likelihood estimation of t-mixtures. Assessment of the 
variational scheme employed has shown it to be closer 
to the 'ground truth' MCEM than a fully factorial 
approximation scheme. Empirical results of a set of 
controlled experiments have shown a systematic and 
statistically significant improvement in terms of cor- 
rect outlier detection rates in high measurement uncer- 
tainty conditions, as a result of appropriately incorpo- 
rating knowledge about the measurement errors. Fi- 
nally, a real application of our method to detecting pe- 
culiar, high-redshift quasars from the SDSS photomet- 
ric quasar catalogue was demonstrated. Further work 
may concern extensions to robust projection models 
(Archambeau et al., 2006) and investigating ways of 
including an interactive visual element into the analy- 
sis of outliers for data with error information. 
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